Skip to content
This repository was archived by the owner on Jul 1, 2025. It is now read-only.

Conversation

beicy
Copy link
Contributor

@beicy beicy commented Jul 16, 2019

Summary:
This PR added user-defined partition flow. Basically, a struct "PartitionConfig" containing the partition info is passed into Partitioner to enable this flow. Now we let users have the full control of how to do the partitioning.
To use this flow, users can write their helper function to generate PartitionConfig, and call Partitioner directly.
In the following PR, we will add passing PartitionConfig through HostManager from a yaml file.

Related to #2298
Documentation:

[Optional Fixes #issue]

Test Plan:
Added unittest. ninja test.

Please see a detailed explanation of how to fill out the fields in the relevant sections in PULL_REQUEST.md.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like fstream is used. (If it is, put it with the other #includes above.

Copy link
Contributor

@bertmaher bertmaher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have a handful of nits before landing. Thanks for prioritizing this :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this go before selectRepFunc (and maybe before getBackendMap)? It seems like it would circumvent the whole rest of the partitioning process so it'd be good to not do any irrelevant work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do it in front of selectRepFunc. However, we do need to run getBackendMap for later partition validation and logging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only semi-related to this diff but I don't think (void)subF; is needed anymore, since subF is legitimately used below, even in Debug mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sp: Paritition -> Partition

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: sp and inconsistent caps. // Validate memory usage. would read better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sp: validation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sp: validation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// This is struct is for user defined partition.
/// This is struct for user defined partition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// The name of the functiton to be partitioned.
/// The name of the function to be partitioned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// The number of user defined partitions. The partition ids are [0,
/// The number of user defined partitions.
/// The partition ids are between 0 and numOfPartitions - 1, inclusive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
llvm::StringMap<size_t> nodeToParitition;
llvm::StringMap<size_t> nodeToPartition;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a map between node's name and unsigned partition number? It wasn't clear to me initially that name is used here. Also, need to clarify that Glow mangles names, meaning that Glow's node name is never the same as name from Caffe2 protobuf.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<NodesSet> nodesSets;
std::vector<NodesSet> nodesSets(partitionConfig_.numOfPartitions);

Then you won't need nodesSets.push_back(NodesSet{}); later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If there is un used partition and un-mapped nodes, map those nodes to the
// If there is unused partition and unmapped nodes, map those nodes to the

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check that partitionID is < partitionConfig_.numOfPartitions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// un-used partition.
// unused partition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert(unUsed.size() && "Missing node to partition mapping.");
assert(unUsed.size() == 1 && "There must be exactly 1 unused partition.");

@beicy beicy force-pushed the auto1 branch 2 times, most recently from bfb931b to c21d412 Compare July 16, 2019 23:14
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doxygen style and above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is backendNames.size() == numOfPartitions. Why do we need numOfPartitions at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means backendNames.size() should be equal to numOfPartitions. Otherwise, there is an error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just have backendNames alone and the size tells how many partitions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also works. The reason I added this separate field is: I think it is clear for people just use a number directly as a reference. The we compare the size of partition names and backends in case something is missing by accident.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please, add function name to the error message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused.

This comment was marked as resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no asserts. DCHECK, add partition id to the error message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: partitionFromConfig

Copy link

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beicy has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

// we need the mapping between quantized tensor and original tensor.
FunctionToBackendNameMap funcToBackend;
funcToBackend = backendBasedPartition(F_, backends, cctx);
module_->eraseFunction(F_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that the function can be renamed and will no longer match the name in partitionConfig?

Copy link
Contributor Author

@beicy beicy Jul 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. If partitionConfig is set, Partitioner will go to user-defined flow without considering other options like "-load-profiling". So later renaming in the rest flow won't affect the function name.

/// mapped to this partition id are not in this map, and the nodes mapped to
/// other partitions ids must be in this map. The node's name should be the
/// name in Glow function and may be different from the original name from
/// models. Since Glow will mangle names to make them unique.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not only to make them unique. We also e.g. replace / with _. Maybe something else.
You may write: Since Glow will mangle names according to its own conventions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see that it's already bad API. Let's say caller 1 has only partitionConfig. But now it has to pass false, false for some other 2 flags. Caller 1 has to figure out what is saturation, and what to pass for corresponding flag to disable.

It would be a good practice to make sure that incompatible flags are not passed together. E.g. saturation=false when user defined partition. But these checks can quickly explode if we have more flags/modes.

What if we later introduce another feature, third mode for partitioner? Say, PyTorch-defined partition? Then caller 3 will have to figure out what to pass for partitionConfig argument to disable user-defined partition feature.

This is why I think we should never accept a whole all-inclusive list of args in one function (in this case, partitioner constructor). I was going to propose create different method for user defined partition. But if you are going to refactor everything later into different classes, I'm also fine with that.

partitionMap.add(node, funcList[partitionID]);
nodesSets[partitionID].insert(node);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else {
what if there are unused partitions? Meaning that some partitions are empty. Is it valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be at most 1 unused partition (there is the check). Otherwise, it is invalid. Since it means we have empty functions in the module and will cause problems later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will cause problems later

It is better to nail these problems early. Here we still can report what exactly is wrong: no nodes were mapped to a partitions #i, #j, #k... Later it will be difficult to track down where the empty function is coming from. In fact, it may fail with obscure error, so first debugging step would be to figure out that the function is empty, and only after that find out why it's empty. I really suggest to report as many problems with the partition as we can find.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. So far, if there is empty function, it will cause the assertion and stop. But you are right, the error message is not enough. Will improve the debugging ability. Thanks!

Copy link
Contributor

@gcatron gcatron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Great work!

@beicy
Copy link
Contributor Author

beicy commented Jul 17, 2019

@artemrakhov Thanks a lot for the suggestion on the interface. Let's have more discussion about the interface offline.

@facebook-github-bot
Copy link

@beicy merged this pull request in 809a3e8.

@beicy beicy deleted the auto1 branch July 25, 2019 22:08
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants